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1 Motivation 

In many naturally occurring optimization problems one needs to ensure that the definition of the optimization problem 
lends itself to solutions that are tractable to compute. In cases where exact solutions cannot be computed tractably, it 
is beneficial to have strong guarantees on the tractable approximate solutions. In order operate under these criterion 
most optimization problems are cast under the umbrella of convexity or submodularity. In this report we will study 
design and optimization over a common class of functions called submodular functions. 

Set functions, and specifically submodular set functions, characterize a wide variety of naturally occurring optimiza¬ 
tion problems, and the property of submodularity of set functions has deep theoretical consequences with wide ranging 
applications. Informally, the property of submodularity of set functions concerns the intuitive principle of diminishing 
returns. This property states that adding an element to a smaller set has more value than adding it to a larger set. 
Common examples of submodular monotone functions are entropies, concave functions of cardinality, and matroid 
rank functions; non-monotone examples include graph cuts, network flows, and mutual information. 

In this paper we will review the formal definition of submodularity; the optimization of submodular functions, both 
maximization and minimization; and finally discuss some applications in relation to learning and reasoning using 
submodular functions. 

2 What is Submodularity: Formal Definition 

We define submodularity as the property of set functions / : 2 V —» R, which assign to each subset S C V a value 
f{S). Here l ’ is a finite set called the ground set. We also assume that /(0) = 0. 

Definition 1 A set function f : 2 V —> R is called submodular if it satisfies 

f(X) + f(Y) > f(X UU) + f(XnY)WX,YCV 

The function / lends itself to different forms in different application domains. In a machine learning context / could 
be a function that evaluates information of a given set, i.e entropy. Using this notion, we can easily introduce the 
property of diminishing returns by using an equivalent definition for submodularity. 

Definition 2 A set function f : 2 V —> R is called submodular if it satisfies 

X —> f(X U k) — f(X) is non-increasing 
f(X U k) - f{X) > f(Y U k) + f(Y) V X cYVk (£ X 

Finally, a set function / is called supermodular if —/ is submodular, and if / is both sub- and supermodular then the 
function is called a modular function. 

2.1 Notation I 

In this section we will introduce some notation that we will consistently maintain through the course of this 
document, unless specified. The ground set over which the submodular functions are defined will be denoted by 
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V with cardinality ?r. For a vector x £ R v and a subset Y C V we define x(K) = x(u). We can naturally 

wer 

extend this definition to capture the positve and negative parts of the vector x as x + £ R v and x~ £ IR v , where 
x + (u) = max{x(u), 0} and x~ (u) = min{x(u), 0}. For a submodular function / we define a polyhedral convex set 
P(/) called the submodular polyhedron: 

P(f) = {x£R v \ x(S) < f(S) VS C L} 

The face of P(f) for which x(V) = J(V ) which defines the base polyhedron: 

B(f) = {x £ P(f) | x(V) = f(V)} 

The elements of B(f) are bases of the set V or the polyhedron P( V). 

2.2 Properties of Submodular Functions 

The basic properties of submodular functions are enumerated below. These properties will help us recast many of our 
optimization objectives as submodular optimization problems. 


• Lemma 2.1 (Closedness Properties ): Submodular functions are closed under nonnegative linear combi- 

k 

nations, i.e if {/i, / 2 ,..., fk } are submodular then the function g(X) = a,/, (X) is submodular Va* > 0. 

i—1 

Corollary 1.1: The sum of a modular and submodular function is a submodular function. 

Corollary 1.2: (Restriction /marginalization): if Y C V, then X — > f(X n Y) is submodular on V and Y. 

Corollary 1.3: (Contraction / conditioning): If X C Y and / is submodular, then g(X) = f(Y \ X) is 
submodular. Equivalently if Y C V, then X —? f(X U Y) — f(Y) is submodular on V and F \ Y 

• Lemma 2.2 (Partial Minimization): Monotone submodular functions remain submodular under truncation, 
i.e if f(X) is submodular then g(X) := min{/(X), c} for any constant c is submodular. 

Note: This property is not necessarily preserved for max or min for two submodular functions. 

• Lemma 2.3 (Cardinality Based Functions) If /( X) is a submodular function, then g( X ) = d(f(X 'j) is 
also submodular if <f>() is a concave function. 

• Lemma 2.4 (Lovasz Extension) A function / is submodular function, iff its Lovctsz Extension / is convex, 
where 

/(c) = max{c T x | x(U) < f(U) V U C V and c £ [0,1]"} 

3 Submodular Optimization 

Submodular functions have many interesting connections with convex and concave functions as demonstrated by 
Lemma 3. Just as minimization of convex functions can be done efficiently, unconstrained submodular minimization 
is also possible in strongly polynomial time. Submodular function maximization in contrast is a NP hard combinatorial 
optimization problem, but approximate solutions can be found with guarantees. In fact a simple greedy solution 
method obtains a (1 — 1/e) approximation, given that we are maximizing a non-decreasing submodular function under 
matroid constraints. 

3.1 Submodular Function Minimization 

Submodular function minimization can be divided into two categories, exact and approximate algorithms. Exact 
algorithms obtain global minimizers for a problem whereas approximate algorithms only achieve an approximate 
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solution i.e for a set X the solution f(X) — min f(Y) < e, where e is as small as possible. If e is less than the 

minimum absolute difference between non equal values of /, the solution computed corresponds to the exact solution. 

An important practical aspect of submodular function minimization is that most algorithms come with online 
approximation guarantees due to a duality relationship , which we will detail in the following subsections. 


3.1.1 Submodular Function Minimizers 

For the lemmas stated below, we consider / to be a submodular function where {/ : 2 V —> 1R | /(0) = 0}. 


• Lemma 3.1 (Lattice of minimizers for submodular functions): The set of minimizers of / is a lattice, i.e if 
X and Y are minimizers then X UY and X D Y are also minimizers. This is evident from Definition 1 and 
Lemma 1.1 

• Lemma 3.2 (Diminishing return property of minimizers of submodular functions): The set X C V is a 
minimizer of / on 2' iff A' is a minimizer of 2 X —► 1R defined as Y C A' —> f(Y) and if 0 is a minimizer 
of the function from 2 V \ X —> 1R then it is defined as Y C V \ X —► f(Y U X) — /(A) . This can be easily 
shown from Definition 1. 


Corollary 6.1 : (Norm Characterization): Suppose x is a minimizer of 

min || a^|| 2 subject to x £ B(f) 

X 

Then a minimizer A for / can be obtained as follows: 

A = {u £V | x(u) < 0} 


• Lemma 3.3 (Dual of minimization of submodular functions)'. 

mi:n /(X) = ma x x~ (V) = f (V) - min ||x|| 

xcv x£B(f) x£B(f) 

As mentioned in Section |2T] (x~)k = minjxfc, 0} Vfc £ V. If X C V and x £ B(f), we have 
/(X) > x~(V) with equality iff {x < 0} C X C {x < 0} and x(X) = /(X). 


min /(X) 
xcv v ' 


max x(V) 

XC:P(f),X<0 


Again if X C V and x £ P(f) \ x < 0, then /(X) > x(V) iff {x < 0} C X and s(X) = f(x). 


3.1.2 Minimum Norm Point Algorithm 

As an example of Submodular function minimization we present the minimum norm point algorithm. A non combina¬ 
torial approach proposed by Fujishige HI is based on the norm characterization of the minima of / shown in Lemma 
2.2. Fujishige uses Wolfe’s algorithm f2|] which was developed to minimize the L 2 norm of a vector in a convex hull 
of a finite set of points P £ ]R ra . This method maintains the vector x as a convex combination of points S and iterates 
over the following steps: 

1. A new point from P with a norm with respect to x is added to to set S. 

2. A point with the minimum norm x is computed in the affine hull of S. 

3. The minimum norm point x is projected onto the convex hull of S. 

In the case of submodular functions, one needs to search through the set of all bases P which is exponential in size. 
This issue is circumvented by using Edmonds Greedy Algorithm (3i). 
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Algorithm 1 Minimum Norm Point Algorithm 

1 : Initialization: x <— extreme base generated using arbitrary ordering, S t— { x } 

2: loop 

3: Selection of new base using Edmonds Greedy Algorithm: y' «— argmin x T y 'iy £ B(f) 

v 

4: if x T y ' = x T x then 

5: return x 

6: else 

7: S^SUfe'} 

8: Minimization over affine hull of S': zf- argmin \\y\\'i where y £ S' 

v 

9: Projection on to convex hull of S: 

10: while z ^ relint(conv(S)) do 

11: zf- intersection of [z, x] and S 

12: S f— face of S intersected by [z, x] 

13: x <- A 


3.2 Submodular Function Maximization 

Problems for the form max /(A) for any submodular function / occurs in various applications. Problems of these 

kind are known to be NP-Hard. Feige and Mirrokni (4[ showed that maximizing for non-negative submodular func¬ 
tions a random subset achieves at least l/4 th the optimal value and local search techniques achieve at least a 1/2. 
Though these problems are NP-Hard, a (1 — 1/e) approximation can be obtained when maximizing a non-decreasing 
submodular function under matroid constraints. The solution to the arbritary matroid constraint was shown more re¬ 
cently by Vondrak et al 0- The initial result of an (1 — 1/e) approximation was shown by Nemhauser 0 in the 70s, 
but the result was only applicable to uniform matroid (cardinality) constraints. The solution to the uniform matroid 
contraint consists of a simple greedy algorithm that has implication in online learning and adaptive submodularity. 

• Lemma 3.4 Local minima for submodular function minimization: 

Given a submodular function {/ : 2 V —> 1R, I /(0) = 0} and X C V such that Vfc £ A', f(X\\k}) < f( X) 
and Vfc G V \ X, /(A U {k}) < /(A) holds true. 

Then, VF C A and VF D X , /(F) < f(X) 

3.2.1 Greedy Algorithm for Monotone Submodular Function Maximization with Uniform Matroid 
Constraints 

Maximization for arbitary constraints can be achieved using Vondrak’s algorithm. In this document we’ll focus on the 
greedy algorithm as it has implications in the online learning domain. Note: Maximization can also be formulated 
using the base polyhedron given we have / and its lovasz extension /. In this case maximization is equivalent to 
finding the maximum 11-norm point in the base polyhedron. See [7| for more details. For monotone submodular 
maximization subject to uniform matroid constraints, we need to find a set X* C V such that 

X* = argmax / (AT) 

||A'||<n 

where n is the cardinality (uniform matroid) constraint. Though this problem is NP-hard we can get an approximate 
solution with an approximation of (1 — 1/e) of the optimal solution. The algorithm for obtaining this solution is shown 
in Algorithm [2] 


Algorithm 2 Greedy Algorithm 

1 : Initialization: Start with X = 0 
2: for i = 1 to n do 

3: y' := argmax /(X U y) 

v 

4: X := X U y' 
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3.3 Adaptive Submodularity 


The process of adaptively making decisions with uncertain outcomes is fundamental to many problems with partial 
observability. In such situations the decision maker needs to make a sequence of decisions by accounting for past 
observations and adapting accordingly. It has been shown by Golovin and Krause [8] that if a problem is adaptively 
submodular, then an adaptive greedy algorithm is guaranteed to obtain near optimal solutions. For clarity in the 
discussion of the Applications section, we introduce the notion of adaptive submodularity and adaptive monotonicity 
in this section. 

3.3.1 Preliminaries and Notation II 

Let V be the ground set. Assuming each item of the set x £ V can take a number of states from a set of possible 
states O, we represent item states as (f> : V —> O which is a function that gives the realization of the states of all 
items in the ground set. Hence <j>(x) is the state of x under the realization f. Now consider a random realization 
characterized by the random variable «I>. Then we can assume a prior probability distribution over realizations as 
p{<jf) := P[4> = cj)\. In cases where we observe only one realization $(:r) at a time, as we pick an item x £ V at 
a time; we can represent our observations so far with a partial realization i.e a function of a subset of V and its 
observed states. Hence % C V x O is { (x. o) : x( x ) = o}. Here we denote the domain of x, i.e the set of items 
observed in x as dom(x) = {x : 3o.(x,o) £ \}- When a partial realization x is equal everywhere with <j> in the 
dom(x), they are consistent <f> ~ x- This implies that all the items observed with specific states in \ h ave also been 
observed with the same states in <fi. Now we extend this notion to subsets by saying if x and x' are consistent with 
(j> and dom(x) C dom(x'), then x is a subrealization of x!■ In a Partially Observable Markov Decision Problem 
(POMDP) sense, partial realization encompasses the POMDP belief states. These determine our posterior belief given 
the effect of all our actions and observations. 

P(<,t> I X) ■= P[$ = (j>\® ~ x] 

Definition 3 (Conditional Expected Marginal Benefit): Given a partial realization \ and a item x, the conditional 
expected marginal belief of x conditioned on having already observed X is denoted by 5(x \ x) 

6(x\x) =W(dom(x)U{x},$) - f(dom(x),$) | $ ~ x] 


Definition 4 (Adaptive Monotonicity): A function f : 2 V x O v —?• 3f?+ is adaptive monotone with respect to the 
distribution p{(jf) if the conditional expected marginal benefit of any item x is non-negative, i.e V\ with P[$ ~ \] > 0 
and all x £ V 

5(x | x) > 0 

Definitions (Adaptive Submodularity): A function f : 2 V x O v ->• 3?+ is adaptive submodular with respect to 
the distribution p((f>) if the conditional expected marginal benefit of any fixed item does not increase as more items 
are selected and their states are obsen’ed, i.e if f is adaptively submodular w.r.t to p((f>) if V x an d x' where x is a 
subrealization of x' and all x £ V \ dom(x !). the following condition holds true: 

S(x | x) > S(x | x') 

Given these definitions we can now use the greedy algorithm defined in Algorithm[3]to give an a approximation to the 
best greedy solution for online maximization problems of adaptively montone submodular functions. This means we 
find an x' such that 

S{x' | x) > -&{x I X) 
a 

The budget for these maximization problems OR the number of rounds we’d like to maximize is similar to the cardi¬ 
nality constraint of submodular problems. 

4 Applications 

4.1 Feature Selection 

In machine learning and statistics, feature selection is one of the most important concepts. The aim of this process is to 
select a subset of relevant features for use in model construction and parameter fitting. In real world problems, we often 
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Algorithm 3 a-Approximate Greedy Adaptive Algorithm 

1: Input: Budget n, ground set V, p{(j>) and function / 

2: Output: X C V where || A'|| = n 

3: Initialize: X «— 0 and x 0 
4: for i = 1 to n do 

5: Va; G V \ X; Evaluate S(x \ x) — E[/(dom(x) U {x}, <f>) — /(dom(x), $) | $ ~ x] 

6: x * = argmax S(x | x) 

7: X 4— X U x* 

8: Observe :4>(a;*); Update: <J?(a;*) 


begin learning with a large number of candidate features that may be redundant, irrelevant, or noisy. Such features 
needlessly increase the complexity of our models and may lead to overfitting and poor generalization to previously 
unseen data. By pruning out redundant or irrelevant features, we can gain: 

• improved model interpretability 

• increased computational efficiency, particularly during parameter fitting and prediction 

• enhanced generalization of the model by reducing overfitting 


In feature selection, we search among features and choose the ones that are, in a broad sense, most informative or 
useful. This definition can be interpreted as an optimization problem for choosing a subset of features which maximize 
the mutual information between features and labeling function. 


Hence, if V indicates the set of all features and a binary vector S indicating the chosen feature set, and xs the real 
valued vector of feature values for features in S, then assuming that |,S'| |-| < b, we can write the problem as: 

max s I(y,x s ) 


where y is the labeling function. 


4.1.1 Submodularity 

We will now show that this is submodular. Suppose A C B C S and rn B. We can show that 


I(y;x A Ua: m ) - I{y\x A ) > I(y; x B U x m ) - I(y; x B ) 

& H(y\x A ) - H(y\x A ,x m ) > H(y\x B ) - H(y\x B , x m ) (1) 

we can write 

H{y\x A ) - H(y\x A ,x m ) 

= H(y\x A ) + H(x m \x A ) - H(y,x m \x A ) 

= H(y\x A ) + H(x m \x A ) - H(y\x A ) - H(x m \x A ,y) 

= H(x m \x A ) - H(x m \x A ,y) (2) 

By substituting into equation [I] we can see that there are cases where this problem is not submodular. Here we give 
the necessary conditions for submodularity: 

• Lemma 4.1 If ax’s are all conditionally independent given y, then the function is submodular (9). 

This constraint is met in many practical machine learning problems. If the ax’s are all conditionally independent given 
y, then equation |2]can be written as 


H(y\x A ) - H(y\x A ,x m ) = H(x m \x A ) - H(x m \x A ,y) 

and if we substitute this in equation [T] then it follows that 


I( y,x A U )-I(y,x A ) > 
4 = 4 ' H (Xm\x A ) 7 ^ 


I{y, x B Ux m ) - I(y; x B ) 
H( x m \x B ) 


Hence, the problem of feature selection can be written as a maximization of a submodular function. ED 
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4.2 MAP Inference 


In this section we will specifically look at the problem of Maximum a posteriori inference on graphs. To analyze the 
algorithms in greater detail, we would like to introduce a few preliminary notions, including the concept of Polyma- 
troids. 


4.2.1 Polymatroids 

The notion of submodularity was first studied in the context of matroids. A set system (V, T) is defined by a ground 
set V and a family of subsets TC 2' . Such a system is a matroid if 

• 0 £ T 

• if.YCTeJ then IeJ 

• if X, Y eJ and ||X|| > ||F|| 3eeX\Y such that T + eeJ 

Now we define a function p called a rank function, which assigns a natural number to each subset of V. This rank 
function is analagous to rank functions of matrices, in fact a matroid which is the set of linearly independent columns 
of a matrix A is called a metric matroid. We define our matroid rank function p as follows 

p{X) = max{\\F\\ \ F £ F, F C X} 

If p is a rank function of a matroid ( V. F) then the following properties hold: 

• p{X)< ||X||VJf cy 

• p is non decreasing: if X C Y C V then p(X) < p(Y) 

• p is submodular 


If a set function p satisfies the above properties for a ground set V then the resulting structure ( V. p) is called a 
polymatroid. Similarly if ( V. p) is a polymatroid, then the family of subsets 


defines a matroid (V, F). 


F = {FCV | p(F) = ||.F||} 


4.2.2 Cuts in Graphs, Energy Minimization and MAP Inference 

Consider a directed graph G = ( V, , A , W) with positive edge weights w : A -A IR - . We can define a positive directed 
cut for a given set of vertices S C V as the set of edges starting in S and ending in V \ S : 5 + (S) = {( i,j ) £ A \ 
i £ S, j £ V \ 5}, Similarly a negative directed cut is S~(S) = {( i,j ) € A \ i £ S, j £ V \ S'}. Finally the cut 
S(S) = c) + U (i~, this for an undirected graph would be the set of edges with exactly one end in S. Hence we can 
define the weight of a cut as 

/+ = w(e), f- = w ( e )’ / = w ( e ) 

e£<5+(S) e&S~(S) e£S+(S) 


Given these cut functions one can note that these cut functions are submodular. 


• Lemma 4.2 The cut functions f + , f and f are submodular 
Proof: For the function /, suppose X,Y C V then, 

f(X) + f(Y)-f(XUY)-f(XnY)= ™(m)+ E w 0\i) 

ie{A'\r},jG{y\x} *e{x\r},je{r\x} 

from the non-negativity of edge weights we can quickly conclude the above function is submodular. Similarly 
submodularity can be proved for f + and /“. 


Now in order to formalize our notion of Maximum a posteriori estimation as a submodular function minimization 
problem, we introduce the following notation. Consider the function E : {0, l} n —» 1R defined over binary variables 
X = {|,.... x „}. Such functions are called regular functions ifTTI . We can define an equivalent set function E 

E(S) = E(x) where Xi = 1 if and only if i £ S 
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We define the class T 1 to be functions that can be written as a sum of functions of up to two binary variables at a time. 

E(xi, ....x n ) = Y.E\xi) + y^ y E l ’ J (xj,Xj) 

i i<j 

The regularity of the binary function from T 2 translates to submodularity of the equivalent set function. 

Given an input set of nodes V in a graph Q and a set of labels C, the labeling l (which is a mapping from V to C) can 
be deduced by minimizing some energy function. In graph based energy minimization problems in computer vision 
and machine learning, the standard form of the energy function used is as follows 

E(l) = Dpjlp ) + Vp,q(lp,lq) 

pev p.q&M 


where J\f C V x V is a neighbourhood set of nodes. D p is a cost function derived from assigning label l p to node 
p. 1 fp q is the cost of assigning labels l p , l q to adjacent nodes p, q. If V is a non-convex function of ||( p — ( g || which 
accounts for border labeling, the energy function E(l) is called a discontinuity preserving energy function. This label 
assignment problem is similar to the graph cut problem, as the labeling function is submodular in the context of the 
regular functions defined earlier. Minimizing this energy function E is equivalent to finding the minimum cut of the 
graph Q. However one should note that the solution depends on the exact form of the function V and it cannot be 
convex as it leads to oversmoothing of borders. In the case when V(l p , l q ) = T[l p ^ l q \, where T is the indicator 
function. This smoothness term is called the Potts Model. The solution shown above can readilyb e extended to more 
than two labels, or beyond the binary problem. We use the binary problem to motivate the result shown above. This 
result is widely used in computer vision in the domains of image segmentation, stereo correspondence and multi¬ 
camera image reconstruction. Another widely used application of this approach is to find the Maximum a posteriori 
estimate of a Markov Random Field fl2l . 

Consider a set of random variables X = {X\, ...., X n } defined on a set S such that the variable X, can take the value 
Xi from the set C = {/ 1 , Then X can be defined as a Markov Random field with respect to the neighbourhood 

set N = {Ni | * £ S'} iff, the positivity property P(x) > 0 and the Markovian property P[xi \ x$\i) = P(xi \ 
Xn,) V'i £ S. Here P(x) = P(X = x), P(xi ) = P(X * = x)} and finally P(X i = X\,...,X n = x n ) = (X = 
x) where x = {xi \ i £ S} is a realization of the field. 

Given these definitions the MAP estimate of the MRF can be formulated as an energy minimization problem, where 
energy corresponding to a realization of x (configuration of the field) is given by the negative log likelihood of the 
joint posterior probability of the MRF 

(j>{x) = —logP{x | D) 

Hence the corresponding energy function for the Potts model becomes 

E (x ) = j <t>{D\xi) + ip(xi, Xj) 
ies \ jeUi 


where 

and 


4>{D\xi) = ~logP{i £ S) 


l/j(Xi,Xj) = 


K v 

0 


if Xi ^ Xj 
if Xi = Xi 


Here K lq is some penalty cost which makes ; >b{x r . x :) ) non convex. 


Finally we can conclude that the energy minimization problem solved by min-cut, max flow which yields the minimum 
energy solution is equivalent to finding the maximium a posteriori solution of a Markov Random Field. 


4.3 Active Learning 

4.3.1 Supervised learning theory 

In classic supervised machine learning, the learning algorithm (or learner ) is given the task of finding a response 
function / : X v-> y that predicts as accurately as possible the output response Y £ y for a given input ob¬ 
servation X £ X ED. Responses take a variety of forms. In classification, this may be a label from a dis¬ 
crete set of choices y = {1,2,...}, while in regression it may be continuous. One of the most common tasks 
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is binary classification, in which y = ±1. We have some unknown underlying distribution V over the space 
of observations and responses X x y, so that observation-response pairs are sampled according to (X, Y ) ~ V. 
The learner chooses from candidate functions or hypotheses in a hypothesis space TL with the goal of minimiz¬ 
ing the expected error or risk ev{h) = E(x,y)~x>[err(/i(X), F)]. In other words, the learner’s goal is to find 
h* that minimizes the risk: h* = argmax; ie -H ep(/i). For standard classification tasks, the error function is sim¬ 
ply the indicator function of a mistake 1 {h(X) f V }, and so the risk is simply the probability of a mistake 
e(h) = E(x,y)~x>[l{/t(X) 7^ X}] = Pr {h(X) ^ Y}. For continuous response functions and multiclass classi¬ 
fication where order matters, there are a wide choice of more complex error functions. 

Of course, in practice V is unknown and so it is impossible to directly minimize the risk. Instead, the learner is 
provided with “supervision” in the form of a finite sample of observation-response pairs, i.e., a labeled training data 
set S = {Xi, ki}i=i,..., n where <S = n. The learner can then approximate V using S and minimize the empirical 
error over S : 


1 V— 

e s {h) = E (A -,r)es[err(/i(X),y)] = - ^ err(/r(X l ), Yf) 

TL 

i—l 


Note that this definition of empirical risk assumes that samples (X, Y) are identically independently distributed (IID), 
a fairly common assumption in supervised machine learning. In the empirical risk minimization (ERM) paradigm, 
the learner assumes that the sample S is sufficiently representative of V such that choosing h = argma eg 
will yield a hypothesis h that will also have a relatively low risk ep(/i) lfl4l . A well known theoretical result for 
classification that comes from Vapnik tells us if we want to learn a “good” classifier from a hypothesis class H , then 
we need roughly |<S| = O {d/s 2 log(l/<5)) points in our training sample m. Here e is the maximum deviation that 
we will tolerate between the true risks of h and optimal h* and S is the probability with which we are willing to let 
this happen (i.e., we want \e(h) — e(h*)\ < e to hold with probability 1 — S). Informally, d represents the “size” of our 
hypothesis class; formally, it is the VC dimension. A useful rule of thumb is that for most useful hypothesis classes, 
the VC dimension scales linearly with the number of parameters and so the number of training samples needed scales 
linearly with “complexity” of the model. 

It is important to distinguish two cases of supervised learning, based on realizability. When the problem is realizable, 
then there exists some hypothesis h £ FI that can perfectly predict the response for every point (i.e., err(/T) = 0); 
in binary classification, this corresponds to the problem being “separable” by a hypothesis in H . When err (h) > 0, 
the problem is not realizable ED- The presence of label noise , where the same point may receive different responses, 
further complicates this picture. If a training sample S contains noisy labels (perhaps due to error), this may mislead 
the ERM. If the true data distribution allows points to have different labels (i.e., our true labeling function is stochastic), 
then at best we may only be able to model P(Y\X), rather than make perfect predictions. 


4.3.2 Selective sampling as a submodular problem 

Imagine the following problem, which we will call the selective sampling on a budget problem: given a large, fully 
labeled finite sample S , we will “purchase” a subset CCS (where \C\ -C |<S| because our “cost” scales with |£|) and 
train h = argmax^ 6 ^ sc{h) with the goal of minimizing e-p (h). We are given full access to S until we make our 
purchase decision, at which point we can use only C to choose our final hypothesis (i.e., we must “forget” everything 
we know about S\C). This can be thought of as choosing the smallest possible representative subsample C. Intuitively, 
it is similar to a set cover problem: we want to pose queries that “cover” (i.e., eliminate) as many false hypotheses 
(inconsistent with our labeled data set) as possible. This problem clearly has submodular structure. 


Lemma 4.3 The selective sampling on a budget problem is submodular and monotone decreasing. 


Proof 4.4 We provide a non-rigorous justification. First, for labeled subsample A, define a hypothesis set FI a C Tl 
that contains all hypotheses from FL that are consistent with the labeled points in A: FI a = {h : e_ 4 ((i) = 0 A h £ FI}. 
Now define a function f(A) = 1 — i.e., maps A to the value of 1 minus probability mass (under a uniform 

prior) of its consistent hypothesis set. Now consider labeled subsamples B and B' such that B C B' C S and 
arbitrary point X B, B'. The key insight here is that as we add labeled points to our subsamples, we can only 
remove hypotheses from our current hypothesis space; once a hypothesis has been removed, it cannot be re-added. 
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/ is monotone increasing: Suppose that hypothesis h £ TLb 1 - This means that it is consistent with every labeled point 
in B'. Because B C B', h must also be consistent with every point in B and so h £ TLb- Therefore, TLb’ Q TLb and so 


I^B'I 

< 

\TL b \ 

\Hb>\/\H\ 

< 

\Hb\/\H\ 

1-\Hb'\/\H\ 

> 

1-\Hb\/\H\ 

f(B') 

> 

m 


whenever B C B'. 

f is submodular: Now suppose that adding point X to B' removes hfrom TLb', he., h £ TLb' \ T~Lb'u{x}- Because 
TLb’ C TLb whenever B C B', it must also be the case that TL B 'u{x} Q 'hi-BJtxj and so h £ TLb \ T~Lbu{x}- Thus, if 
adding X to B' removes m hypotheses from TLb 1 , then adding X to B must remove n > m hypotheses from TLb: 


m < 

\7~Lb' \ ~ I^B'I + tn < 
\TLb'\/\TL\- (\TLb'\- m)/\TL\ < 

-l + \H B '\/\H\ + l-{\H B >\-m)/\H\ < 
l-(\HB>\-m)/\H\-(l-\H B '\/m < 

f(B'U{X})-f{ff) < 


\T~Lb\ - \L~Lb\ + n 
\TLb\/\TL\ - (\TLb\ - n)/\TL\ 

-l + \n B \/\H\+l-(\H B \-n)/\H\ 
l-(\HB\-n)/\H\-(l-\H B \/\n\) 
f(BU{X})-f(B) 


whenever B C B'. 

This is intuitive. If TL' C TL contains all hypotheses in TL that are inconsistent with A’s label, then clearly TLb' Q TLb 
implies that TLb 1 \ C C TLb \ C. 

This result may not seem terribly exciting, but what it does suggest is that we can solve the selective sampling on 
a budget problem using a greedy approach: on the 1th iteration, choose the X that eliminates the largest number of 
inconsistent hypotheses from our current TLb'- 

( X,Y) t = argmax V] l{h(X)^Y} 

( x,Y)eS\B t heUBt 

4.3.3 Greedy active learning is adaptive submodular 

Now imagine a variation of the above selective sampling problem where we do not have access to the label of X £ S 
until we “purchase” it. Here we might use active learning. Active learning is a variation of the supervised learning 
paradigm where the learner does not receive access to a fully labeled data sample S upfront. Rather it has access to an 
unlabeled data sample LA = {(A', ?)}, as well as an oracle that the learner can query for the response (or label) of an 
observation, Y = or (A" ) |fl6l . The active learner is given agency to choose which individual samples to label, but each 
query has a cost c and the learner has only a limited budget to spend on labeling data. Similar to the selective sampling 
scenario described above, the active learner has dual goals: to choose simultaneously a labeled subset of observations 
C CU and a hypothesis h = argmin/jg-^ ic(h) 0- e -> h is th e ERM for C) that will yield the best possible predictive 
performance (i.e., lowest risk ex>(h)). 

When evaluating active learning algorithms, we are concerned primarily with two performance properties: the quality 
(in terms of risk) of the hypotheses they produce and their query efficiency. Intuitively, a good active learner will 
use a very small number of label queries to produce a hypothesis with very small predictive error. More formally, 
we are interested in (1) how the error of the hypothesis produced by an active learner that chooses labeled subsample 
C compares with that of the hypothesis that we could learn from a fully labeled sample S where CCS; and (2) 
how many label queries must be made to achieve a certain level of performance, which we call label complexity 
and express in Big-Oh notation. An ideal active learner will compete with fully supervised learning with |£| « INI- 
More realistically, we hope to at least place an upper bound on the error of active learning that is within a constant 
(multiplicative or additive) factor of the error of fully labeled supervised learning. 

It is not hard to design greedy approaches to active learning, but two questions arise: first, are there greedy active 
learning algorithms that have sound theoretical guarantees about error and label complexity; and second, can we show 
that such algorithms are in fact specific cases of more general approaches based on submodularity? The answer to 
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both of these questions is, in fact, yes m. Let Ct be the set of labeled data points after t queries (and recall that 
7 -Lc t is the set of hypotheses from TL consistent with the labeled data in Cf). For the next (t + 1) label query, we want 
to choose the unlabeled point that provokes the greatest disagreement between hypotheses in 7f£ t .The maximum 
disagreement occurs when half of the hypotheses predict one label and the rest the other. Equivalently, this minimizes 
the absolute value of the sum of all predicted labels: | Yhhe'Hc ( w h en using ±1 labels). Following this query 

policy, we hope to cut the tth hypothesis space roughly in half with query t + 1 and achieve a label complexity that is 
roughly 0(log(d/e)) for d the size (e.g., VC dimension) of the hypothesis space and deviation bound e. fi~7) shows 
that in the worst case, this strategy may have to query every single label; indeed, for certain pathological cases, even 
the optimal query strategy will need to query every label. However, the average case analysis is much more promising. 
On average, the greedy strategy’s label complexity is at most (9(log d) times larger that that of the optimal policy, as 
we show below in Theorem [45] which rephrases Claim 4 and Theorem 3 from C3: 


Theorem 4.5 (Dasgupta lfT7l ) Suppose the optimal query policy requires M labels in expectation for target hypothe¬ 
ses chosen uniformly from hypothesis class Tl of (VC) dimension d > e e ss 16. Then the expected number of labels 
queried by the greedy strategy is at least and at most AM log d. 

As Ifl7l points out, the lower bound is a bit depressing, but we derive some comfort from the fact that the upper 
bound matches the lower bound within a multiplicative factor. We can extend this analysis to a Bayesian framework 
where we have a nonuniform prior distribution over hypotheses n(h). In this setting, we seek a label query that will 
divide the probability mass over hypotheses (rather than the hypothesis space itself) in half. We do this by minimizing 
the absolute value of the sum of predictions weighted by the prior probabilities of the hypotheses making them: 

| Ylhefit 7T {x)h(x)\. In this case, the d term in the lower and upper bounds is replaced with min^g-^ n(h). 

In®, the authors show that the hypothesis space reduction problem is adaptive submodular, specifically an example 
of an adaptive stochastic coverage problem. Here our ground set is the set of all points V = {x : x £ U \, and each 
point has an unobserved state O = {y : y £ y} = {± 1} where <h(x) = y for the pair (x, y) and for a given hypothesis 
h £TL, <hfc,(x) = h(x). The set of labeled points C t forms a consistent partial realization \t at iteration t. Then we can 
define a function that takes as input an element subset V' and a realization function <!>' and maps it to a real number in 
the interval [0,1]: 

f(V', $') = f(H = {h : h(x) = $'(x) for all x £ V 1 }) = 1 - ^ n(h) 

hGH 


So for a labeled subset C t , f(V t ,<&t) = /(£*), where V t = {x : (x , y) £ C t } and <I> ( (x) = \E' t (x) = y for 
(x, y) £ Ct- This function is adaptively submodular, as shown in Lemma 4.6 adapted from |[8l : 


Lemma 4.6 (Golovin and Krause (8)) The hypothesis space reduction problem is adaptive submodular and adap¬ 
tive monotone. 

The proof of monotonicity follows along lines similar to the one used in Theorem |4.3 [ basically, querying a label can 
only remove hypotheses, and hypothesis probabilities are nonzero, so removing one can only reduce the value of /. 
The proof of submodularity is more subtle, though it rests on the same intuition as that of monotonicity and involves 
comparing the conditional expected marginal benefits, as described in. Interestingly, this angle yields a slightly more 
optimistic average case analysis than that given in Theorem [4.5] above, removing the constant multiplier from the 
upper bound. We give it below in Theorem |4.7[ adapted from [8J: 


Theorem 4.7 (Golovin and Krause l|8)) Suppose the optimal query policy requires M labels in expectation for tar¬ 
get hypotheses chosen using distribution n from hypothesis class TL. Then the expected number of labels queried by 

the greedy strategy is at most M (log + l) • 

4.3.4 New directions 

This is a wonderful example of cross fertilization between research in computer science theory and optimization and 
learning theory. Working on submodularity and adaptive submodularity, computer scientists were able to rediscover 
and generalize previously published results from machine learning, improving an upper bound along the way. More 
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important, they provided new and useful insights into the problem, relating it to other problems (which we did not 
discuss in this section) and paving the way to new discoveries. Recently, there has been an explosion of similar work, 
much of it published in 2013. IT8) describe a framework for performing distributed submodular maximization in a 
shared-nothing (MapReduce) storage setting and using it to choose a representative subsample of a massive data set for 
learning (similar to our selective sampling on a budget problem). m describe a greedy batch-mode active learning 
algorithm that queries labels in batches of size k > 1 and show that this approach is competitive not only with optimal 
batch-more active learning but also with more traditional greedy active learning. There are a variety of other papers 
pushing the boundary in this area (20) 1211 ll22l f23l . 


There are two lines of work that seem conspicuously absent (at least, based on our admittedly myopic literature 
review): (1) applications of submodularity to streaming active learning; and (2) “aggressive” active learning in the 
nonrealizable case. The former involves active learning when we do not have access to the entirety of U at the start 
of the learning process. Rather, we receive one data point at a time in an online fashion and must make a query 
decision for point X t based only on the samples U t that we’ve seen so far. The above greedy algorithm and analysis 


require that we be able to choose a point X t = argminx g ^\£ t tt{h)h{X) . We cannot, of course, do this 

in the streaming setting. Nonetheless, intuition suggests that we may be able to extend the adaptive submodularity 
framework (or some related idea) to this setting. 


Aggressive active learning in the non-realizable case is a wide open problem, at least as of Informally, aggressive 
active learners, which include greedy active learners, are those that attempt to make the “most informative” label 
query at each step. In the realizable case, it is possible to develop aggressive algorithms that are statistically consistent 
(will discover the optimal hypothesis with enough queries) and have sound theoretical guarantees for label complexity. 
However, these guarantees go out the window with realizability. Perhaps some form of submodularity may help here, 
but at first blush, it looks as though the nonrealizable case will not satisfy the assumptions necessary for adaptive 
submodularity. We do have a variety of mellow active learners, which seek any informative label query, that are label 
efficient and statistically consistent EH E). It would be interesting to develop a new analysis of these algorithms 
in terms of submodularity and then to see if this analysis perhaps provides a bridge between mellow and aggressive 
active learning. 


5 Submodularity in Weighted Constraint Reasoning 

Many application require efficient representation and reasoning about factors like fuzziness, probabilities, preferences, 
and/or costs. Various extensions to the basic framework of Constraint Satisfaction Problems (CSPs) l25l have been in¬ 
troduced to incorporate and reason about such “soft” constraints. These include variants like fuzzy-CSPs, probabilistic- 
CSPs , and Weighted-CSPs (WCSPs). A WCSP is an optimization version of a CSP in which the constraints are no 
longer “hard,” but are extended by associating (non-negative) costs with the tuples. The goal is to find an assignment 
of values to all variables from their respective domains such that the total cost is minimized. 

For simplicity, we restrict ourselves to Boolean WCSPs. Note that this class can be used to model important combina¬ 
torial problems such as representing and reasoning about user preferences l26l . over-subscription planning with goal 
preferences E3, combinatorial auctions (28), and bioinformatics (29l . energy minimization problems in probabilistic 
settings, computer vision, Markov Random Fields |3Q| . etc. In addition, many real-world domains exhibit submodu¬ 
larity in the cost structure that is worth exploiting for computational benefits. In what follows, we define a class of 
submodular constraints over Boolean domains and give a polynomial-time algorithm for solving instances from this 
class. 


5.1 Weighted Constraint Satisfaction Problems 

Formally, a WCSP is defined by a triplet (X. 'D. C) where X = { X t , X -2 ... Wv} is a set of variables, and C = 
{ C\. (’•>... Cm} is a set of weighted constraints on subsets of the variables. Each variable X, is associated with a 
discrete-valued domain Di £ V, and each constraint C, is defined on a certain subset Si C Tof the variables. .S', 
is referred to as the scope of C r ; and C, specifies a non-negative cost for every possible combination of values to the 
variables in .S',. The arity of the constraint C, is equal to S, \. An optimal solution is an assignment of values to all 
variables (from their respective domains) so that the sum of the costs (as specified locally by each weighted constraint) 
is minimized. In a Boolean WCSP, the size of any variable’s domain is 2 (that is, Di = {0,1} for all i). Boolean 
WCSPs are representationally as powerful as WCSPs; and it is well known that optimally solving Boolean WCSPs is 
NP-hard in general 1251 . The constraint graph associated with a WCSP instance is an undirected graph where a node 
represents a variable and an edge (X, , X } ) exists if and only if X, and Xj appear together in some constraint. 
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Figure 1: The table on the right-hand side represents the projection of the minimum weighted VC problem onto the IS 
{V-|. X,\ } of the node-weighted undirected graph on the left-hand side. (The weights on X 4 and X 7 are set to 3 and 
2, respectively, while all other nodes have unit weights.) The entry ‘7’ in the cell ( X\ = 0, X 4 = 1), for example, 
indicates that, when X-\ is prohibited from being in the minimum weighted VC but X 4 is necessarily included in it, 
then the weight of the minimum weighted VC - { X- 7 . X :i . X 4 . X 7 j or {X 2 , X 3 , X 4 , X 5 , X () } - is 7. 


5.2 Submodular Constraints 

Submodular constraints over Boolean domains correspond directly to submodular set functions. A set function ip : 
2 V —> Q defined on all subsets of a set V is submodular if and only if, for all subsets S,T C V, we have ip{S U 
T) + ip(S (T T) < ip(S) + ip(T). A submodular constraint is a weighted constraint with a submodular cost function. 
Here, the correspondence is in light of the observation that any subset S can be interpreted as specifying the Boolean 
variables in V that are set to 1. Boolean WCSPs with submodular constraints are known to be tractable OTA . However, 
the general algorithm for solving Boolean WCSPs with submodular constraints has a time complexity of 0(N 6 ), 
which is not very practical. Specific classes of submodular constraints have been shown to be related to graph cuts, 
and are therefore solvable more efficiently ED. 

5.3 Lifted Graphical Representations for Weighted Constraints 

Constraint Composite Graphs (CCGs) are combinatorial structures associated with optimization problems posed as 
WCSPs. They provide a unifying framework for exploiting both the graphical structure of the variable interactions as 
well as the numerical structure of the weighted constraints L32il . We reformulate WCSPs as minimum weighted vertex 
cover problems to construct simple bipartite graph representations for important classes of submodular constraints, 
thereby translating them into max-flow problems on bipartite graphs. 

The concept of the minimum weighted VcQon a given undirected graph G = (V. E) can be extended to the notion 
of projecting minimum weighted VCs onto a given IS0t/ C V. The input to such a projection is the graph G as well 
as an identified IS C7 = {u \. u 7 ... v.k} • The output is a table of 2 fc numbers. Each entry in this table corresponds 
to a A:-bit vector. We say that a fc-bit vector imposes the following restrictions: if the i th bit is 0 (1), the node Ui is 
necessarily excluded (included) from the minimum weighted VC. The value of an entry is the weight of the minimum 
weighted VC conditioned on the restrictions imposed by it. Figure[l]presents a simple example. 

The aformentioned table can be viewed as a weighted constraint over \U\ Boolean variables. Conversely, given a 
(Boolean) weighted constraint, we can think about designing a “lifted” representation for it so as to be able to view 
it as the projection of a minimum weighted VC problem in some node-weighted undirected graph. This idea was 
first discussed in (33]. The benefit of constructing these graphical representations for individual constraints lies in the 
fact that the “lifted” graphical representation for the entire WCSP can be obtained simply by “merging’ them. This 
“merged” graph is referred to as the CCG associated with the WCSP. Computing the minimum weighted VC for the 
CCG yields a solution for the WCSP; namely, if X, is in the minimum weighted VC, then it is assigned the value 1 in 
the WCSP, else it is assigned the value 0 in the WCSP. Figure [I] shows an example WCSP and its CCG. 

Any given weighted constraint on Boolean variables can be represented graphically using a tripartite graph, which can 
be constructed in polynomial time li32l . In many cases, the lifted graphical representations even turn out to be only 
bipartite. Since the resulting CCG is also bipartite if each of the individual graphical representations are bipartite, 
the tractability of the language £|*°° z r e t “" e - the language of all Boolean weighted constraints with a bipartite graphical 
representation - is readily established. This is because solving minimum weighted VC problems on bipartite graphs is 
reducible to max-flow problems, and can therefore be solved efficiently in polynomial time. 

'A vertex cover (VC) is a set of nodes S such that every edge has at least one end point in S. 

2 U is an independent set (IS) of a graph if and only if no two nodes in U are connected by an edge. 

3 nodes that represent the same variable are simply “merged” - along with their edges - and every “composite” node is given a 
weight equal to the sum of the individual weights of the merged nodes. 
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Figure 2: Shows a WCSP over 3 Boolean variables. The constraint network is shown in the top-left cell, and the 6 
binary and unary weighted constraints are shown along with their lifted graphical representations in the 1 st and 2 nd 
rows. The CCG is shown in the bottom-right cell. 



Figure 3: Lifted graphical representations for different kind of terms, (a) represents a linear term, either positive or 
negative, where w\ and u >2 are chosen appropriately, (b) represents a negative quadratic term, (c) represents a negative 
cubic term, (d) illustrates the “change of variable” method and essentially represents a leading positive cubic term, 
(e) is the same “flower” structure shown in (c), but with a “thorn” introduced for each variable. The resulting graph is 
also bipartite, because the auxiliary variable A can be moved to the same partition as the original variables. The graph 
now represents the term —w(l — X, )(l — X ;) ) ( 1 — X k ) , which in effect, is a bipartite representation for an expression 
with a leading positive cubic term. 


Finally, Boolean weighted constraints can be represented as multivariate polynomials on the variables participating in 
that constraint 1311 l32l . The coefficients of the polynomial can be computed with a standard Gaussian Elimination 
procedure for solving systems of linear equations. The linear equations themselves arise from substituting different 
combinations of values to the variables, and equating them to the corresponding entries in the weighted constraint. 
One way to build the CCG of a given weighted constraint is: (a) build the graphical representations for each of the 
individual terms in the multivariate polynomial; and (b) “merge” these graphical representations 1321 . 

5.4 Submodular Constraints with bounded arity 

The focus of lf34l is on bounded arity submodular constraints (that is, submodular constraints with arity at most K , for 
some constant K ) and providing asymptotically improved algorithms for solving them. The reason these submodular 
constraints can be solved more efficiently is because the underlying max-flow problems are staged on bipartite graphs. 
For Boolean WCSPs with arity at most K , the bipartite CCG has N nodes in one partition, at most 2 K M nodes in 
the other partition, and at most K2 k M edges. For K bounded by a constant, this results in a time complexity of 
Q(NM log M). This significantly improves on the 0((N + M) 3 ) time complexity of the algorithm provided by 
iJJlQ Figure [3]shows the lifted graphical representation for all possible terms of a constraint of arity 3. 

4 For arity I \, M could be as large as (^). 
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5.5 Social Influence 


With the increasing popularity of online social network websites and apps, such as Facebook and Twitter, social net¬ 
works now play a fundamental role as medium for people to share, exchange, and obtain new ideas and information. 
Modeling, utilizing, and understanding social influence has become “hot topic” in computer science, machine learn¬ 
ing, and computational social science I35ll36ll37ll38ll39ll40ll4ni42l. It turns out that submodular functions and 
especially submodular function maximization play fundamental roles in solving algorithmic questions associated with 
social influence. In this section, we mainly focus on using submodular function maximization techniques to solve two 
problems related to social influence, namely influence maximization lf36l and network inference 051 . 

5.5.1 Influence Maximization 

Assume that now a company wants to promote its new product among the individuals in a social network (real or 
virtual). The company has a limited budget to give free samples of their new product to the users in the social network. 
A natural question to ask is to which set of users should the company give the free sample products, such that the 
overall adoption of the new product can be maximized. The question is exactly the influence maximization problem, 
namely selecting a small set of seed nodes in a social network such that its overall influence coverage is maximized 
under a certain diffusion model. 

Among the many diffusion models, the Independent Cascade (IC) model and Linear Threshold (LT) are used 
widely in the study of influence maximization (361 . Both IC and LT models are stochastic models characterizing how 
influence propagates throughout the network, starting from the initial seed notes. 

For influence maximization, the objective function a(S), where S is the initial seed set, is the expected number 
of activated nodes under the diffusion model. The problem is simply to maximize a(S) subject to the cardinality 
constraint |Sj < k. 

It has been shown that the influence maximization problem under both IC model and LT model is NP-hard (36). 
However, by the following theorems the problem allows efficient approximation algorithm. 

Theorem 5.1 (Kempe et al. (361 ) The objective function of influence maximization problem under both IC and LT 
model is non-negative, monotone and submodular. 

By the classic greedy algorithm for monotone submodular function maximization, a 1 — 1/e approximation guarantee 
can be achieved. The proof of the theorem is by the fact that conic combination of submodular functions is also 
submodular. The objective function is a expectation and can be written as 

«s)= E Prob[X)a{S\X), 

outcome X 

where X is any realization of the stochastic diffusion process. A reachability argument for both IC model and LT 
model can be used to show that cr(SjX) is submodular under any X. 

Though the greedy algorithm can solve the problem approximately in polynomial time, its key step, i.e., evaluation 
of the marginal gain a(S U {u}) — cr(S), can take a long time for large networks. Many papers have been published 
on how to improve the efficiency of the algorithm (431 [44, 45. [46] [47], such as by lazy evaluation (44l [451, or by 
approximate evaluation of the marginal gain MMM ■ 

A more general result on Generalized Linear Threshold models has been proved generalizing the results for the 
IC and LT models in (48) . The proof uses a sophisticated stage-wise coupling argument to show that submodularity 
applies. The idea of the proof is to add the initial seeds and propagate the influence stage by stage. The key component 
in the proof is the anti-sense coupling used in the last stage. 

A extension to influence maximization that draws much attention recently is to solve this problem under the com¬ 
petitive influence. Competitive influence implies that two or multiple competitive products, or ideas are propagating 
simultaneously in the social network. The influence maximization problem naturally extends to maximizing one’s 
own influence (38l [391140) or minimizing the influence of the competitors mm given the choices of the initial seeds 
of the competitors. For example, on the maximization side, (49) study the influence maximization when a user can 
dislike the product and propagate bad news about it. On the minimization side, |50j study the idea of influence block¬ 
ing maximization, which focuses on selecting seeds to block the propagation of rumors. Both approaches solve the 
optimization problem by showing the objective function is monotone and submodular. The proof technique is similar 
to that in (361 . However, their arguments are much more complicated due to the interaction of competitive diffusion. 
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5.5.2 Network Inference 


The influence maximization problem takes as its input a social network structure with the strength of influence on 
each edge. However, in most cases, the underlying network that enables the diffusion is hidden (e.g., networks on 
who influenced whom). The most common observations of information diffusion are only the activation time for 
individual in social network (e.g., the time stamps of when a person posted a blog containing certain information 
or retweeted another user’s tweet, when a user bought a certain product in viral marketing applications, etc.). The 
network inference problem focuses on discovering the diffusion network from the observed cascades occurring 
among the individuals in a social network. Existing approaches to this problem solve a maximum likelihood estimation 
problem with respect to the network structure under certain diffusion models l35l [37, 5J 521. It turns out that the 
likelihood function of this problem can be approximated with a submodular function. Thus, submodular function 
maximization can be used to solve this problem Il35ll37l . 

The extended IC model is used as the diffusion model in If35lf37l . In the extended IC model, each edge is associated 
with an activation probability. Moreover, each activation has time delay. For example, if node v is activated at time t v 
and the activation attempt to v’s neighbour u succeeds. Then u with become activated at time t u + At, where At is 
the time delay. In the model, the delay time satisfies exponential distribution or power law distribution, namely 

Pd(At) oc e~^ or P d (At) <x 

Then according to the model, if v which is activated at time t v and v succeeds in activating node u which becomes 
activated at time t u . Then the probability this activation occurs is: 

Pc{v,u) Pd(,tu ty'jPvu 

Also the model assumes that influence can only propagate forward in time, which means P c (v,u ) = 0 if tv ^ tu- 
Then if the pattern of cascade c forms a tree T, the probability that the cascade is observed given the tree is 

p(c\T)= n 

C i,i)er 

In addition, if we assume the who-infect-who relation forms a tree pattern (one is only activated by one person). Then 
given a certain G, the probability we observe the cascade c would be 

P(c\G)= P(c\T)P(T\G) a £ ]J P c (i,j) 

TgT(G) TgT(G) ( i,j)GT 

where T(G ) is all directed spanning tree on G. Therefore, if we have observed a set of cascades C = {ci, C 2 ,...}, 
The probability of observing all these cascades is 

P(C\G) = n P(c\G). 

cec 

Under this configuration, the network inference problem is to find an graph G = (V. E) with less than k edges such 
that 

G = arg max P(CjG) 

\E\<k 

In the objective function, we sum over all spanning trees of the graph G, which is super-exponential. In order to 
make this computation feasible, we instead solve an approximation in which only the spanning tree with the maximal 
likelihood is considered, namely 

P(C\G) = ll mx P(c\T) = 1] II P 'hi) 

cec y 1 cec y ’ (i,j)eT 

Then we define F C (G) as the difference between the log likelihood of cascade c over graph G and empty graph K. 

FAG) = max logP(clT)— max logP(clT) 

TGT(G) TGT(K) 

and take sum over all the cascades, we have 


F c {G) = Y,Fc(G) 
cec 
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K is a graph with all the nodes in G and also a extra node m. The only edges in K are the edges from m to every 
other nodes with activation probability e and delay 0. The extra node represents the external influence. Then the 
optimization problem can be rewritten as 

G* = arg max Fc(G) (3) 

\E\<k 

ED proved that this objective function is monotone and submodular. Therefore, a greedy algorithm can achieve 1 - / 
approximation for solving it. This approach was later improved by by the MultiTree algorithm in ED, where the the 
matrix tree theorem is used to calculated the exact summation over all possible spanning trees, rather than using the 
maximum spanning tree approximation. 

In a on-going project by Xinran He with Prof. Yan Liu, we are experimenting with using Maximum a Posteriori infer¬ 
ence to solve the network inference problem. The previous approaches assume no prior knowledge about the structure 
of the inferred graph. However, it has been shown repeatedly that social networks have many unique properties, in¬ 
cluding heavy-tail degree distribution, small diameter, community structure, and so on. We propose a social network 
generative model over the diffusion network as a prior to incorporate the prior knowledge about network structure. 
Our current choice is the Kronecker graphs model |[53l[54i . The Kronecker graphs model is a parametric model which 
can provide a probability for the existence of each edge in the social network. The existence of each edge is considered 
independent under this model. Using this model as a prior, we can change the objective function Fq{G) in Equation[3] 
to 


F'ciG) = F C {G) -f (log Proh[e exists] — log Prob[e not exists]). 

e&E 

After adding a modular function to a submodular function, the resulted F'c (G) is still submodular, however it may 
not necessary be monotone any more. As a result, the simple greedy algorithm can not be used to solve this problem. 
Instead, we can use the algorithm proposed in li55l with a 1/2 + o(l) approximation guarantee. 
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